Applied Data Analysis in Python

Machine Learning

As explained at the beginning of the last chapter, machine learning is the name given to the tools, techniques and algorithms which are used to extract information from data.

There are two main classes of machine learning:

  1. Supervised This is where you learn about the relationship between some measurement of the data and some label of the data. Once the relationship is established, you can then use it to predict what label to apply to new measurements. Supervised learning falls into two categories:

    1. classification where the labels are discrete. For example identifying the species of a flower from some measurements of its petals.
    2. regression where the labels are continuous. For example estimating the price of a house based on its number of rooms, size of garden etc.
  1. Unsupervised This is where you don't have any label associated with the data and the algorithm will need to extract features of interest itself. Examples of this are:

    1. clustering
    2. dimensionality reduction

In the last chapter we use a linear regression model. The algorithm that scikit-learn uses is supervised as we were providing it with both the measured quantities ($X$) and the expected output (or target), $y$. Since the target was a continuous variable we were performing a regression.

Most of the time when people think of machine learning they are thinking of a supervised algorithm.

The supervised learning process

There is a standard process that you go through with most supervised learning problems which is worth understanding as it affects how you should go about data collection as well as the types of problems you can use it to solve. For this explanation, let's pretend that we want to create a model which can predict the price of a house based on its age, the number of rooms it has and the size of its garden.

They all start with the initial data collection. You go out into "the wild" and collect some data, $X$, this could be people's heights or lengths of leaves of trees or anything in your field which you consider easy to collect/measure. In our example here we would measure a large number of houses' ages, room counts and garden sizes. We put these data into a table with one row per house and three columns, one for each feature.

Alongside that you, as an expert, label each data point you collected with some extra data and call that $y$. In our case $y$ is the price of the house. This is something which requires an expert's eye to assess and we are trying to create a predictive model which can replace the job of the expert house-price surveyor.

Once you have collected your data, you need to choose which model best represents the relationship between $X$ and $y$. Perhaps a linear regression is sufficient or maybe you need something more complex. There is no magic solution to knowing which model to use, it comes from experience and experimentation.

Using $X$ and $y$ we train (or "fit") our model to predict the relationship between those two (making sure to split them into train and test subsets to validate our model). After this point the parameters of our model are fixed and can be detached from the data that were used to train it. It is now a "black box" which can make predictions.

Imagine now that a client comes to us and asks us to estimate how much they might be able to sell their house for. They tell us how old the house is, how many rooms it has and the size of its garden. We put these three numbers ($x^\prime$) into our model and it returns for us a prediction, $\hat{y}$.

Of course, $\hat{y}$ is just a prediction, it is not necessarily correct. There is usually a true value that we are hoping that we are near to, $y$. We can measure the quality of our model by seeing how close we are to to the true value by subtracting one from the other: $y - \hat{y}$. This is called the residual.